Linking Named Entities



Named Entities and Linked Data

The named entities we have recognised in the Henslow data would be much more useful if they could be linked to other data known about those entities. This principle is called linked data. Linked data can enrich the discovery of collections and allow sophisticated searches for the knowledge within those collections.

If the data is freely available and openly licensed it is known as linked open data (LOD). The diagram below shows the extent of LOD in 2010. Since then then the linked open data cloud has grown immensely and you can explore it for yourself at www.lod-cloud.net.

Linked open data cloud in 2010

Linked data is a very big topic, so this notebook will only touch on a few introductory aspects that relate to the NER we have done in this course. In particular, we will focus on the automated ways of linking data that can be enabled by writing code, though the underlying principles can be understood without it.



Disambiguate with an Authority File

One of the challenges with named entities is that there may be many different forms, spellings or abbreviations that refer to the same person, place, country, and so on.

An authority file is a way of normalising and unifying this information for each entity into a single, authoritative authority record and giving it a unique identifier. Typically, all the known forms of a particular entity will be recorded in its authority record so that every form can be resolved to the same, correct entity.

You may already be familiar with VIAF: The Virtual International Authority File, which is an authority service that unifies multiple authority files for people, geographic places, works of art, and more.

VIAF: The Virtual International Authority File

By simply searching for a name in the search box, it returns a VIAF ID, preferred and related names, and associated works.

assets/viaf-charles-darwin.png



Lookup Entities Programmatically with Web APIs

The power of centralised authorities such as VIAF is when their data is exposed via an API (Application Programming Interface). A web API is accessed via a particular web address and allows computer programs to request information and receive it in a structured format suitable for further processing. Typically, this data will be provided in either JSON or XML.

VIAF has several different APIs, which we as humans can explore using the OCLC API Explorer.

EXERCISE: Click on the link above and then on the link "Auto Suggest". Modifiy the example query to search for "john stevens henslow" or any a personal name from the Henslow letters that you can recall.

You should get something like this:

assets/viaf-api-charles-darwin.png

It has returned a list of results, in JSON format, with VIAF's suggestions for the best match, which you can see in the right-hand "Response" pane.

We can consume this data programmatically using Python tools with which we are already familiar from earlier notebooks.

If you compare this with the output of the API explorer above, you should see this is the same structure and information.

The VIAF ID is found in the 'vaifid' field:

With this information we could now enrich the original XML with the VIAF ID for this named entity.

EXERCISE: What are the problems you could anticipate with this sort of automated linking?



Lookup Named Entities in Bulk using Web APIs

The scientific community has been busy normalising, disambiguating, and aggregating similar types of data for decades, in a movement parallel but largely separate to developments in library science and humanities.

The Global Biodiversity Information Facility (GBIF) is an international open data aggregator for hundreds of millions of species records.

Global Biodiversity Information Facility

In the last notebook we tried to add a new named entity type TAXONOMY for the model to learn. We defined this as a type of entity for any Linnaean taxonomic name (domain, kingdom, phylum, division, class, order, family, genus or species). Binomials (genus plus species together) were labelled as one span.

Imagine if we wished to link these named taxonomic entities to the corresponding genus or species in the GBIF. How could we do this en masse?

Like VIAF, GBIF also has a set of web APIs and we can use the Species API to search for species names.

EXERCISE: Reading API documentation is a common activity for coders. Before you look at the code example below, open the Species API documentation, scroll down to the 'Searching Names' section and see if you can work out which of the four resource URLs would be most useful for our case.

Lungwort: Flowers of Pulmonaria officinalis

Flowers of Pulmonaria officinalis

Let's start by trying one taxonomic (genus) name "Pulmonaria" to see what sort of result we can expect:

So far, so very similar to the VIAF API.

Reconciling Historical Taxa

In reality, we need to be aware that some of the older names given to organisms in the Henslow letters are not easily reconciled with modern named taxa. (In the Darwin Correspondence Project (DCP), Shelley Innes, editor and research associate, is an expert in historical taxonomy and her work is available in the footnotes of the published DCP letters.)

Also, the Henslow letters often use ligature ash ('æ') rather than 'ae', which is used in family names in GBIF. The GBIF suggest API does not recognise 'æ' and 'ae' as equivalent so either our queries will need to be normalised, or we can try a different API.

Gall Wasp - Cynipidae family, Leesylvania State Park, Woodbridge, Virginia

Gall Wasp - Cynipidae family

If there is no matchable name in GBIF we get an empty result:

But if we try the search API instead there is no problem:

Let's now take the list of taxonomic names from the previous notebook, cleaned up and normalised, and try to make an query with the whole list:

Why do you think it took so long? How can you tell if no match was found?

We now have all sorts of exciting information about these species names. Try some of the entity names to see if the search got the correct match.



Named Entities and Knowledge Bases

A knowlege base is a system that stores facts and in some way links them with one another into a store of information that can be queried for new knowledge. A knowledge base may store semantic information with triples to create a knowledge graph where entities (nodes) are linked to other entities (nodes) by relationships (edges).

Formally, a triple is made up of subject, predicate and object. For example:

"Odysseus" (subject) -> "is married to" (predicate) -> "Penelope" (object)

Many triples together form a graph:

Knowledge graph

Each entity is represented by a URI, which is unique and identifies it unambigiously.

Perhaps the most well-known knowledge base is Wikidata, which is collaborative (relies on data donations and user editing) and open (all the data is openly licensed for re-use).

You can get an idea of the vast store of data and query possibilities by using the Wikidata Query Service.

EXERCISE: Try some of the 'Examples' queries from the Wikidata Query Service. Notice that some queries come with visualisations. Why do you think it takes so long for some of the queries to complete?

Find Named Entities in a Knowledge Base

To interact with Wikidata's knowledge base programmatically, we must use a W3C-standard query language called SPARQL (SPARQL Protocol And RDF Query Language).

You can see the SPARQL queries in the Wikidata Query Service examples. They look like this:

#Map of hospitals
#added 2017-08
#defaultView:Map
SELECT * WHERE {
  ?item wdt:P31/wdt:P279* wd:Q16917;
        wdt:P625 ?geo .
}

Unfortunately, SPARQL has a demanding learning curve, but fortunately there are a number of tools for programmers that can make our lives easier.

We are going to use a Python package called wptools to make querying Wikidata as easy as writing simple Python. wptools actually uses the MediaWiki API, which is cheating, or a good idea to avoid SPARQL, or both. 😆

First, let's try a simple string query:

The ID that is printed out 'Q3257667' is the unique Wikidata ID, and the wikidata_url goes directly to the plant's unique URI.

EXERCISE: Try the Wikidata URL now and examine all the information that Wikidata knows about Lobelia urens. Notice in particular that it has a link to the GBIF ID '5408353'.

We can even get the plant's picture programmatically!

EXERCISE: Try searching Wikidata for some of the other taxonomic names and fetching their pictures. What happens if the search is unsuccessful?

Since Wikidata already has a link to the GBIF ID that we have from before, can we query Wikidata directly with the GBIF ID and get the knowledge base information that way?

The answer is yes! But we will have to make a small dive into the world of SPARQL...

Make Simple SPARQL Queries

Rather than use the Wikidata Query Service like a human, we're going to interact with the SPARQL endpoint programmatically. An endpoint is the URL where you send a query for a particular web service. For the curious, here is a big list of known SPARQL endpoints.

Wikidata is the top entry on that list! But the endpoint listed is a bit out of date. We are going to use the main Wikidata SPARQL endpoint at: https://query.wikidata.org/sparql

We're going to use a different Python library called SPARQLWrapper to make the query.

If you now cut and paste the URL that has been returned you should find yourself looking once again at the Lobelia entity. So far so good.

Let's take a moment to understand a bit more about the SPARQL query we just made:

So, overall, the query says "select all information about any entity that has a GBIF ID property of 5408353".

You can read more about Wikidata Identifiers like "P846" and Wikidata prefixes like "wdt:".

Now let's try something a bit more sophisticated, by asking for some additional information available in Wikidata:

If SPARQL takes your interest, and you'd like to learn more about linked open data, I can recommend the Programming Historian's Introduction to the Principles of Linked Open Data and Using SPARQL to access Linked Open Data.

Finally, let's use wptools again to get all the data we might ever want about this plant.

The difference this time is that we looked up the Wikidata ID first, using the unique GBIF ID, so we know we will get the info from the correct entity.



Enrich the Original Data

Let's take a moment to consider the journey we have travelled.

Catbells Northern Ascent, Lake District - June 2009

We could do many things with extra information like this:

I'm sure you can think of more ideas!

Add New XML Markup for Named Entities

To finish our exploration in code of this topic, I will show you a proof-of-concept for how we could add new TEI markup to an original Henslow Correspondence Project letter. I have had to make some simplifications in the example for the sake of brevity.

Lobelia urens (spike)

Heath lobelia close to Brigueuil, Charente, France

First, we will go back to the beginning of our journey and get the original letter where the binomial "Lobelia urens" appears. We can search the XML for the named entity and wrap it in a new XML tag to mark its position.

Can you see where we have added the new tag wrapping the named entity?

Now we want to modify this markup with the linked data we collected earlier, as follows:

<name type="taxon" ref="https://www.gbif.org/species/5408353 https://www.wikidata.org/wiki/Q3257667">Lobelia urens</name>

(Thanks to Huw Jones for supplying the correct TEI form to follow.)

We can create the new markup using BeautifulSoup:

And then place it into the XML:

Finally, we can save the new TEI document to file:

EXERCISE: Review the modified TEI file output/letters_14-taxon.xml and inspect the newly added markup. You may need to download it and open it in Oxygen or another editor to see the markup.

Of course Linked Open Data works both ways: once you have gone to the trouble of linking everything to its Wikidata ID, you may wish to add your data to Wikidata, but that is a big topic for another day.



One final question may have occurred to you during the process of working through this notebook.

If linking is done automatically, potentially without human intervention, how can we be sure the results are accurate?

It's likely in a real-world project you would need some form of human quality control, but an additional approach is to use machine learning to predict links.

There are potentially two ways of doing this:

  1. Build your own entity linker with machine learning.

spaCy has the capability to link named entities to identifiers stored in a knowledge base. For anyone with a lot of computing power and time to hand, there's even some example code to do this with Wikipedia and Wikidata data dumps.

EDSAC II, 10th May 1960, user queue. Copyright Computer Laboratory, University of Cambridge. Reproduced by permission. Creative Commons Attribution 2.0 UK: England & Wales.

The queue for computing time on the Cambridge EDSAC, 1960. To use High Performance Computing today, nothing has really changed, except the queue itself is now managed by software!

  1. Use someone else's pre-built entity linker if they have built something suitable for your use case.

You can check spaCy Universe for resources developed with or for spaCy. One example of an entity linker:



Summary

Covered in this notebook:

Congratulations! 🎉

That's the end of this series of notebooks about named entity recognition. I hope you enjoyed your time working through them.